Reject unresolvable unit symbols instead of dropping them#30
Merged
Conversation
The lexer built its symbol character-class only from registered glyphs, so any other non-ASCII glyph matched no token and String#scan silently skipped it. Unknown ASCII words still matched the baseline class and reached validation, but unknown glyphs vanished: Unit(1, 'λ') returned a unitless 1, Unit(1, 'm λ') silently became 1 m, and Unit(1, 'm/λ') raised a misleading 'Unexpected token /'. Tokenize any run of non-operator, non-whitespace characters as a symbol so every unrecognized glyph survives lexing and fails loudly as 'Undefined unit <symbol>' via the same validation path unknown ASCII symbols already used, naming the real culprit. This also makes the tokenizer independent of what is registered, so it is a plain constant again rather than a per-load rebuild. Add specs covering unknown ASCII symbols, unregistered glyphs (alone, implicitly multiplied, and operator-adjacent), and a unit whose defining system has not been loaded yet.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
String#scansilently skipped it. Unknown ASCII words still matched the baseline[A-Za-z0-9_]+class and reached validation, but unknown glyphs vanished:Unit(1, 'λ')returned a unitless1Unit(1, 'm λ')silently became1 mUnit(1, 'm/λ')raised a misleadingSyntaxError: Unexpected token /Undefined unit <symbol>through the same validation path unknown ASCII symbols already used — and the error names the real culprit (Undefined unit λ) rather than a neighboring operator.Ω,°C,µ,%, units loaded at runtime) still parse, andto_sround-trips are preserved.Breaking: parsing an unrecognized unit symbol now always raises
TypeError: Undefined unit …instead of sometimes silently yielding a dimensionless value. This extends the contract that unknown ASCII symbols already followed to unknown glyphs of any alphabet.Test plan
bundle exec rake— both phases green (140 no-DSL + 142 DSL examples, 0 failures)